
a. To check Quality of Data collected During Survey
b. How Various factors defined during Survey impacting the valuation of Property under Survey in same area?
c. Disparity in the Property Valaution method
Kampala City Authority, Uganda (KCCA) Completed Property Survey, in order to complete the valuation of all properties under them. Property Tax is levied by them on Valuation amount. Property tax is one of the major Revenue head for them. Here I reviewed and analyzed samples of assessed/valued properties in Central Division and used this data (Property attributes and Ratable Values).
During the data collection process, about 14-16 attributes/Factors are captured on each and every property. These attributes can be broadly categorized as details on; Ownership, property location, Usage, Building, Construction, Accommodation and others. It should be noted that although comprehensive data is collected on every property, not all this data is used to make assessments for property Values. Some data is needed/used for administrative purposes for example ownership details.
The purpose of this data analysis is to check Quality of data collected and impact of attributes/ factors on earlier valuation value by KCCA.
Further to this analysis KCCA want to derive model to calculate Valuation (ALV / RV) with the useful Attributes / Factors
Attributes considered / collected during Survey are
1. Total Built Up Area
2. Type of Access
3. Neighborhood status
4. Building condition
5. Availability of NWSC
6. Availability of Hydro Power
7. Type of walls
8. Type of floor finish
9. Type of sanitation facility
10. Availability of Parking space
11. Security measure
12. Type of boundary wall
13. Type of wall finishes
14. Type of ceiling
15. Ownership
16. Usage of the Property
Data collected by KCCA during property Survey done during year 2017-18.
Sample data of one Division is taken for analysis out of 5 division.
import sys # Import packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import pie, axis, show
%matplotlib inline
import warnings # Ignore warning related to pandas_profiling
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100) # Display all dataframe columns in outputs
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style='whitegrid', font_scale=1.3, color_codes=True) # To apply seaborn styles to the plots.
# Making plotly specific imports
# These imports are necessary to use plotly offline without signing in to their website.
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import chart_studio.plotly as py
from plotly import tools
init_notebook_mode(connected=True)
# Install chart_studio
!pip install chart-studio
For Data Visualisation here we are using 2 dataset of KCCA - Property Valuation.
1. Property Master Data - This Dataset contains details about property Ownership details, Property Physical address along with its Latitutde and Longitude details
2. Property Valuation Data - This Dataset contains details about property details which has impact on Valuation and on Revenue of KCCA
pm = pd.read_csv("https://raw.githubusercontent.com/jaymalak/jaymala.kanagala-gmail.com-Term1-2-EDA-Project/master/DATA/Propmst.csv")
pv = pd.read_csv("https://raw.githubusercontent.com/jaymalak/jaymala.kanagala-gmail.com-Term1-2-EDA-Project/master/DATA/propval.csv")
pm.dtypes
pv.dtypes
pm.head()
pm.shape
pm.info()
. This Data set contains total 27 Columns and total 7924 Property details (row)
. There are 7 Numeric and 20 columns with Object type
. There are missing values under following coloumn
1. Prop_org_name
2. Owner first, last and middle name
3. Mobile Number
4. Email IDs
pv.head()
pv.info()
pv.shape
. This Data set contains total 22 Columns and total 7924 Property details (row) . There are 5 Numeric and 18 columns with Object type . There are no missing values
pm.describe(include = 'all')
pv.describe(include= 'all')
import pandas_profiling # Get a quick overview for all the variables using pandas_profiling
profilepm = pandas_profiling.ProfileReport(pm)
profilepm.to_file(output_file="D:\Jaymala\P\Datascience\Project\jaymala.kanagala@gmail.com-TERM 1&2 EDA Project\Before DataProcessing Profiling output\out_pm.html")
pm.profile_report(title='Pandas Profiling before Data Preprocessing', style={'full_width':True})
import pandas_profiling # Get a quick overview for all the variables using pandas_profiling
profilepv = pandas_profiling.ProfileReport(pv)
profilepv.to_file(output_file="D:\Jaymala\P\Datascience\Project\jaymala.kanagala@gmail.com-TERM 1&2 EDA Project\Before DataProcessing Profiling output\out_pv.html")
pv.profile_report(title='Pandas Profiling before Data Preprocessing', style={'full_width':True})
pv_rd = pv.round(decimals=3)
pv_rd.head()
pm['Perish'].value_counts() ### Perish wise Count of Properties
pm.groupby(['Perish','Village'])['Village'].count()
pm['Perish'].value_counts().plot(kind='pie',fontsize=10, autopct='%1.5f%%',figsize=(20,30), shadow=True, startangle=135, legend=True, cmap='summer')
plt.ylabel('Perish')
plt.title('Pie chart showing the proportion of each Perish value')
prop_merge = pd.merge(pm, pv, how='inner', left_on='Property_No._', right_on='Property_No._').drop(['CAMV_ID'], 1)
prop_merge.head()
prop_merge.info()
1. total 47 Columns
2. Eleven Columns are Numeric and 36 columns are of Object type
3. Missing Values are in Prop_ORG_name, Owner's First, middle and last name, Mobile Number, Email
prop_merge.describe()
prop_merge.shape
import pandas_profiling
befprof = prop_merge.profile_report(title='Pandas Profiling before Data Preprocessing')
befprof.to_file(output_file="D:\Jaymala\P\Datascience\Project\jaymala.kanagala@gmail.com-TERM 1&2 EDA Project\Before DataProcessing Profiling output\profiling_before_preprocessing_onmerged.html")
prop_merge.profile_report(title='Pandas Profiling before Data Preprocessing', style={'full_width':True})
. Number of Variables - 47
. Number of Observations - 7924
. Missing Cells - 29978 (8.0%)
. Numeric 8
. Categorical 28
. Boolean 1
. Text 2
. Rejected 8
prop_merge['Perish'].value_counts()
prop_merge.head()
prop_merge.drop(['Property_is_Legal', 'Division', 'Owner_Fname','Owner_Mname','Owner_Lname', 'Title', 'Plot_Area','Plot_No'], 1, inplace=True)
prop_merge.head()
prop_merge.shape
prop_merge = prop_merge.round(decimals=0)
prop_merge.head()
prop_merge.dtypes
prop_merge.Property_Tax = prop_merge.Property_Tax.astype(int)
prop_merge.dtypes
prop_merge.isnull().sum()
prop_merge.drop(['Mobile_No','Email_Id'], 1, inplace=True)
# To convert all Perish in upper case
prop_merge['Perish'] = prop_merge['Perish'].str.upper()
prop_merge['Perish'].value_counts()
prop_merge.head()
prop_merge.shape
# To output pandas profiling report to an external html file.
# Saving the output as profiling_after_preprocessing.html
aftprof = prop_merge.profile_report(title='Pandas Profiling after Data Preprocessing')
aftprof.to_file(output_file="D:\Jaymala\P\Datascience\Project\jaymala.kanagala@gmail.com-TERM 1&2 EDA Project\Before DataProcessing Profiling output\profiling_after_preprocessing.html")
prop_merge.profile_report(title='Pandas Profiling after Data Preprocessing', style={'full_width':True})
plt.figure(figsize = (15, 6))
prop_merge.groupby('Perish')['Property_No._'].count().sort_values(ascending=False).plot(kind='bar', color='g')
plt.ylabel('Count')
1. KAMWOKYA II has maximum number of PRoperties
plt.figure(figsize = (15, 6))
prop_merge.groupby('Perish')['Property_Tax'].sum().sort_values(ascending=False).plot(kind='bar', color='r')
plt.ylabel('Revenue')
## Perish and Village wise Property Count
prop_merge.groupby(['Perish','Village'])['Village'].count()
prop_merge.groupby(['Perish','Village'])['Property_No._'].count().sort_values(ascending=False)[:10]
plt.figure(figsize = (15, 6))
prop_merge.groupby(['Perish','Village'])['Property_No._'].count().sort_values(ascending=False)[:10].plot(kind='bar', color='g')
plt.ylabel('Count')
1. Village name within Perish is repeating. Same Village name "Church Zone found in mutiple Perishes.
2. Village Kisenyi II Zone of KAMWOKYA II Perish ahs Maximum number of proeprties.
3. Civic City Zone which generating Maximum Revenue for KCCA has only 3 villages and count of Properties (269) is also less as compare to other Zone
plt.figure(figsize = (15, 6))
prop_merge.groupby(['Perish','Village'])['Property_Tax'].sum().sort_values(ascending=False)[:10].plot(kind='bar', color='g')
plt.ylabel('Property Tax')
1. Villages from CIVIC centre genearting Maximum Revenue for KCCA.
prop_merge.groupby(['Perish','Village','Usage_Type'])['Property_No._'].count()
plt.figure(figsize=(15,5))
prop_merge[prop_merge['Usage_Type'] =='RESIDENTIALRENTED'].groupby(['Perish'])['Property_Tax'].sum().plot(kind='bar',color='blue',)
prop_merge[prop_merge['Usage_Type'] !='RESIDENTIALRENTED'].groupby(['Perish'])['Property_Tax'].sum().plot(kind='bar', color='green',fontsize=13)
plt.ylabel('Property_Tax')
plt.legend(['Residential', 'Other thanResidential'])
1. Highest count of Properties as per data in KAMWOKYA II
2. But Higest Revenue is from Perish "Civic Center". All Properties Under Civic Centre are of Usage Type = "Other than Residential'
3. Other than Residential Properties generating Revenue for KCCA
plt.figure(figsize = (15, 6))
prop_merge.groupby(['Perish','Village','Usage_Type'])['Property_Tax'].sum().sort_values(ascending=False)[:10].plot(kind='bar', color='g')
plt.ylabel('Property Tax')
sns.catplot(x='Usage_Type', y='Property_Tax', hue='Perish', data=prop_merge, height=30, kind='bar',
palette='spring')
1. Commercial Properties has major impact on KCCA Revenue
2. CIVIC CNTER Perish and Christ The King Zone Village is Top to contribute Maximum Revenue with Commercial Properties
3. Special & Institutional Proeprties also showing spike in Property Tax as compare to other Usage type.
# checking data where Property Organisation is not NULL and usgae of Property is Residential.
# Residential Proeprty can have Owner Names but not Organisation name
is_commercial = prop_merge[(prop_merge.Property_Org_Name.notnull()) & (prop_merge.Usage_Type == 'RESIDENTIALRENTED')]
is_commercial.count() # Count of Commerical Properties where Organisation name in data not captured
is_commercial.head()
plt.figure(figsize = (15, 6))
is_commercial.groupby('Perish')['Property_No._'].count().sort_values(ascending=False).plot(kind='bar', color='g')
plt.ylabel('Count')
1. There are 503 record, where Usage is residential but organisation name is mentioned.
2. KCCA needs to rectify such records, as if Orgnisation name is there means Proeprty usage must be other than
Residential OR may be usage type wrongly entered.
3. KOLOLO IV has maximum such records.
prop_merge.iloc[prop_merge['Property_Tax'].sort_values(ascending=False)[:10].index, :]
higest_taxpayer = prop_merge.iloc[prop_merge['Property_Tax'].sort_values(ascending=False)[:10].index, :]
plt.figure(figsize = (15, 6))
higest_taxpayer.groupby('Perish')['Property_Tax'].sum().sort_values(ascending=False).plot(kind='bar', color='g')
plt.title("Perishwise Revenue from Top 10 Properties")
plt.ylabel("Property Tax")
print(higest_taxpayer.groupby('Perish')['Property_Tax'].sum())
higest_taxpayer = prop_merge.iloc[prop_merge['Property_Tax'].sort_values(ascending=False)[:10].index, :]
plt.figure(figsize = (15, 6))
higest_taxpayer.groupby('Usage_Type')['Property_Tax'].sum().sort_values(ascending=False).plot(kind='bar', color='g')
plt.title("Usage wise Revenue from Top 10 Properties")
plt.ylabel("Property Tax")
higest_taxpayer_carpetarea = prop_merge.iloc[higest_taxpayer['Carpet_Area_'].sort_values(ascending=False)[:10].index, :]
plt.figure(figsize = (15, 6))
higest_taxpayer.groupby('Carpet_Area_')['Property_Tax'].sum().plot(kind='bar', color='g')
plt.title("Carpet Area wise Revenue from Top 10 Properties")
plt.ylabel("Property Tax")
higest_taxpayer_carpetarea.loc[:,['Carpet_Area_', 'Perish','Usage_Type','Avg._Property_Rate_(per_SQM)']]
higest_taxpayer_carpetarea[higest_taxpayer_carpetarea['Property_No._'].isin(["P117014445","P117005985"])]
1. Highest Tax payer are from Perish Civic Center with total tax = 548377830 UGX
2. Higest Tax Payer are of Usage type = Commerical and Institutional only. Mainly Commercial properties are paying
higest tax
3. Within Highest tax Payer, proeprty with less carpet area paying more than property with more carpet area, even though property is of same Usage type(Commercial) and in the same area (Perish = Civic Center)
4. Property rate per Square meter is more, almost double, even though having carpet area less.
Carpet_Area_ Perish Usage_Type Avg._Property_Rate_(per_SQM) Property_No._
16300.0 CIVIC CENTER COMMERCIAL 6740.0 P117014445
7005.0 CIVIC CENTER COMMERCIAL 16160.0 P117005985
May be some other factor affecting the valuation process. Let's see What are the other factors, are they varying/ different for this properties.
futher it is observed that only Tyoe of wall - finish, Type of Ceiling are different beside Carpet Area.
RV is likely impacting due to Type of wall finish, type of ceiling, because of these factors RV is high eventhough Carpet Area is less.
plt.figure(figsize=(20, 20))
sns.boxplot(data=prop_merge, x='Perish', y='RV', palette='viridis', width=1.0)
plt.xticks(rotation=90)
factor_info =['Type_of_Access_(X2)','Neighbourhood_Status_(X3)','Building_Condition_(X4)','Type_of_Walls_(X5)','Type_of_Wall_finish_(X6)','Type_of_Floor_Finish_(X7)','Type_of_Ceiling_(X8)','NWSC_Availabilty_(X9)','Hydro_Power_Availabilty_(X10)','Type_of_Saniation_Facility_(X11)','Availabilty_of_Parking_Space_(X12)','Type_of_Security_(X13)','Type_of_Boundary_Wall_(X14)'
]
for value in factor_info:
plt.figure(figsize=(15,10))
prop_merge[prop_merge['Usage_Type'] =='RESIDENTIALRENTED'].groupby([value])['RV'].mean().plot(kind='bar',
color='blue',fontsize =13 ,)
prop_merge[prop_merge['Usage_Type'] =='COMMERCIAL'].groupby([value])['RV'].mean().plot(kind='bar', color='green',fontsize=13,)
prop_merge[prop_merge['Usage_Type'] =='INSTITUTIONAL'].groupby([value])['RV'].mean().plot(kind='bar', color='grey',fontsize=13,)
prop_merge[prop_merge['Usage_Type'] =='INDUSTRIAL'].groupby([value])['RV'].mean().plot(kind='bar', color='pink',fontsize=13,)
prop_merge[prop_merge['Usage_Type'] =='SPECIAL'].groupby([value])['RV'].mean().plot(kind='bar', color='red',fontsize=13,)
plt.ylabel('RV')
plt.xlabel(value)
plt.legend(['Residential', 'COMMERCIAL','INSTITUTIONAL','INDUSTRIAL', 'SPECIAL'])
prop_new = prop_merge.copy(deep=True) # Creating New dataframe for correlation analysis, as need to drop few fields
prop_new.head()
prop_new.drop(['House_No', 'House_/_Apartment_Name', 'Street_/_Road','Area_/_Locality','Property_Org_Name', 'Contact_No', 'COIN','Property_Address','Latitude','Longitude','Frontage','Property_Tax','Avg._Property_Rate_(per_SQM)'], 1, inplace=True)
prop_new.drop(['Property_Tax','Avg._Property_Rate_(per_SQM)'], 1, inplace=True)
prop_new.drop(['Property_Type','Property_SubType','Occupency_Type_'], 1, inplace=True)
prop_new.drop(['Occupency_Type_'], 1, inplace=True)
prop_new.head()
from sklearn.preprocessing import LabelEncoder # Label encoder is used to transform non-numerical labels to numerical labels.
number = LabelEncoder() # Here as our data set consists only of categorical values except age
for i in prop_new.columns: # in order to find the correlation between variables, we do Label encoding
prop_new[i] = number.fit_transform(prop_new[i].astype('str'))
corr=prop_new.corr()['RV']
corr[np.argsort(corr,axis=0)[::-1]]
features_correlation = prop_new.corr()
plt.figure(figsize=(15,15))
sns.heatmap(features_correlation,vmax=1,square=True,annot=False,cmap='Purples')
plt.show()
plt.figure(figsize=(25,25))
sns.heatmap( prop_new.corr(), annot=True );
plt.show()
prop_new.corr()
1. Survey data of Central Division has given detailed about property valuation
2. Data has disparity, Logical mismatch in Data Observed which impacting Valuation of Data
3. Valuation RV as compare to various Factors is not correlated
4. Many Factors considered for Valuation showing -ve Correlation
1. In Order to increase the Revenue of KCCA,they need to cross check the data with factors collected properly.
2. KCCA need to check the Factors considered for Valuation, as many showing -ve Correlation between each other
2. Need to standardise the Valuation formula based on Various factors of Properties